TEXAS-CRASHES Analysis

Centered Walkability with Quadratic Term

Author

Eric Delmelle | Lehigh University

Published

October 2, 2025

Introduction

This analysis examines the relationship between walkability and traffic crash counts across Texas census tracts, stratified by Rural-Urban Commuting Area (RUCA) classifications.

Key objectives:

  • Test whether the walkability-crash relationship varies between metropolitan, micropolitan, small rural, and isolated rural areas
  • Account for population density and demographic composition (percent white, percent in poverty)
  • Use population as an exposure offset
  • Include both linear and quadratic walkability terms to capture potential non-linear relationships

1. Setup and Data Loading

1.1 Required Packages

Code
required_packages <- c("MASS", "dplyr", "broom", "ggplot2", "sf", "readr", 
                       "purrr", "knitr", "kableExtra", "ggeffects", "spdep", 
                       "tigris")

missing_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(missing_packages)) {
  install.packages(missing_packages)
}

suppressPackageStartupMessages({
  library(MASS)
  library(dplyr)
  library(broom)
  library(ggplot2)
  library(sf)
  library(readr)
  library(purrr)
  library(knitr)
  library(kableExtra)
  library(ggeffects)
  library(spdep)
  library(tigris)
})
Warning: package 'ggplot2' was built under R version 4.3.3
Warning: package 'sf' was built under R version 4.3.3
Warning: package 'ggeffects' was built under R version 4.3.3
Warning: package 'tigris' was built under R version 4.3.2

1.2 Load Spatial and Crash Data

Load crash data and geographic data, then join based on GEOID.

Code
shape_data <- st_read("data.shp", quiet = TRUE) %>% 
  mutate(GEOID = as.character(GEOID))

crash_counts <- read_csv("countcrashes.csv", 
                        col_types = cols(Count_crash = col_double(), 
                                       GEOID = col_character()))

full_data <- left_join(shape_data, crash_counts, by = "GEOID")

2. Data Preprocessing

Filter invalid observations and create analysis variables:

  • Population density capped at 99th percentile to reduce outlier influence
  • Demographic variables standardized for interpretation
  • Walkability index centered (not scaled) to preserve original units while facilitating quadratic term interpretation
Code
f_model <- full_data %>%
  st_drop_geometry() %>%
  filter(
    Population > 0,
    Count_crash >= 0,
    !is.na(RUCA_2),
    LA_sqmi > 0,
    !is.na(NatWalkInd),
    !is.na(Perc_White),
    !is.na(PercPOV)
  ) %>%
  mutate(
    popden = pmin(Population / LA_sqmi, quantile(Population / LA_sqmi, 0.99, na.rm = TRUE)),
    pct_white = scale(Perc_White / 100)[,1],
    pct_pov = scale(pmin(PercPOV / 100, 0.8))[,1],
    log_offset = log(Population),
    NatWalkInd_c = scale(NatWalkInd, scale = FALSE)[,1],
    RUCA_group = case_when(
      RUCA_2 %in% c(1.0, 1.1, 2.0, 2.1, 3.0, 4.1, 5.1, 7.1, 8.1, 10.1) ~ "Metropolitan",
      RUCA_2 %in% c(4.0, 4.2, 5.0, 5.2, 6.0, 6.1) ~ "Micropolitan", 
      RUCA_2 %in% c(7.0, 7.2, 7.3, 7.4, 8.0, 8.2, 8.3, 8.4, 9.0, 9.1, 9.2) ~ "Small Rural",
      RUCA_2 %in% c(10.0, 10.2, 10.3, 10.4, 10.5, 10.6) ~ "Isolated Rural",
      TRUE ~ NA_character_
    )
  ) %>%
  filter(
    !is.na(RUCA_group),
    across(c(NatWalkInd_c, popden, pct_white, pct_pov, log_offset), 
           ~ is.finite(.) & !is.na(.))
  )
Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
ℹ Please use `if_any()` or `if_all()` instead.

3. Model Fitting

3.1 Fitting Strategy

Fit negative binomial regression models separately for each RUCA group. If convergence fails (typically due to extreme zero-inflation), fall back to Poisson.

Code
fit_model <- function(df) {
  # Try negative binomial first
  model <- tryCatch(
    glm.nb(Count_crash ~ NatWalkInd_c + I(NatWalkInd_c^2) + 
           popden + pct_white + pct_pov + offset(log_offset), data = df),
    error = function(e) NULL
  )
  
  # Fall back to Poisson if needed
  if (is.null(model) || !model$converged) {
    model <- tryCatch(
      glm(Count_crash ~ NatWalkInd_c + I(NatWalkInd_c^2) + 
          popden + pct_white + pct_pov + offset(log_offset), 
          data = df, family = poisson),
      error = function(e) NULL
    )
  }
  
  return(model)
}

# Fit models for each RUCA group
models <- f_model %>%
  group_split(RUCA_group) %>%
  set_names(sort(unique(f_model$RUCA_group))) %>%
  map(fit_model) %>%
  compact()
Warning: glm.fit: fitted rates numerically 0 occurred
Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: algorithm did not converge
Warning in glm.nb(Count_crash ~ NatWalkInd_c + I(NatWalkInd_c^2) + popden + :
alternation limit reached
Warning: glm.fit: fitted rates numerically 0 occurred
Code
# Display model summary
cat("Model Results:\n")
Model Results:
Code
for (name in names(models)) {
  model <- models[[name]]
  model_type <- ifelse("negbin" %in% class(model), "NegBin", "Poisson")
  n_obs <- length(fitted(model))
  aic <- round(AIC(model), 1)
  cat(sprintf("  %s: %s, n=%d, AIC=%.1f\n", name, model_type, n_obs, aic))
}
  Isolated Rural: Poisson, n=356, AIC=843.0
  Metropolitan: NegBin, n=13420, AIC=36237.9
  Micropolitan: NegBin, n=1279, AIC=2375.5
  Small Rural: NegBin, n=729, AIC=966.0

Model Summary:

  • Isolated Rural required Poisson fallback (89% zero crashes)
  • All other groups successfully fitted negative binomial models
  • Metropolitan has largest sample (n=13,420) and best fit

4. Spatial Autocorrelation Analysis

4.1 Testing for Spatial Dependence

Spatial autocorrelation in residuals violates independence assumptions and can bias standard errors. We test using Moran’s I and Geary’s C with queen contiguity weights.

Code
# Download Texas county boundaries from Census Bureau (cached for reuse)
tx_counties <- counties(state = "TX", cb = TRUE, progress_bar = FALSE)
Retrieving data for the year 2022
Code
test_spatial_autocorr <- function(group_name, model) {
  # Extract model data for specific RUCA group
  model_data <- f_model %>% filter(RUCA_group == group_name)
  
  # Load spatial geometries and join with model data to ensure alignment
  spatial_data <- st_read("data.shp", quiet = TRUE) %>% 
    mutate(GEOID = as.character(GEOID)) %>%
    filter(GEOID %in% model_data$GEOID) %>%
    inner_join(model_data, by = "GEOID")
  
  # Extract Pearson residuals from the model
  resids <- residuals(model, type = "pearson")
  
  # Create spatial weights matrix using queen contiguity (shared edge or vertex)
  nb <- poly2nb(st_make_valid(spatial_data), queen = TRUE)
  listw <- nb2listw(nb, style = "W", zero.policy = TRUE)  # Row-standardized weights
  
  # Run spatial autocorrelation tests
  moran <- moran.test(resids, listw, zero.policy = TRUE)  # Moran's I: positive values = clustering
  geary <- geary.test(resids, listw, zero.policy = TRUE)  # Geary's C: values < 1 = clustering
  
  # Create map of residuals with county boundaries overlay
  spatial_data$residuals <- resids
  plot <- ggplot() +
    # Plot residuals (blue = negative, red = positive)
    geom_sf(data = spatial_data, aes(fill = residuals), color = NA) +
    # Overlay county boundaries for geographic context
    geom_sf(data = tx_counties, fill = NA, color = "gray30", size = 0.3, alpha = 0.5) +
    # Diverging color scale centered at zero with symmetric limits
    scale_fill_gradient2(
      low = "#2166AC", mid = "white", high = "#B2182B", 
      midpoint = 0, 
      name = "Pearson\nResiduals",
      limits = c(-max(abs(resids), na.rm = TRUE), max(abs(resids), na.rm = TRUE))
    ) +
    # Add title with Moran's I statistic
    labs(title = paste("Model Residuals:", group_name),
         subtitle = sprintf("Moran's I = %.3f (p = %.4f)", 
                          moran$estimate[1], moran$p.value)) +
    theme_minimal() +
    theme(
      plot.title = element_text(face = "bold", size = 14),
      plot.subtitle = element_text(size = 11),
      legend.position = "right",
      panel.grid = element_blank(),
      axis.text = element_blank(),
      axis.ticks = element_blank()
    )
  
  # Return test results and plot
  list(moran = moran, geary = geary, plot = plot)
}

# Run spatial autocorrelation tests for each RUCA group
spatial_results <- map(names(models), ~tryCatch(
  test_spatial_autocorr(.x, models[[.x]]), 
  error = function(e) NULL
)) %>%
  set_names(names(models)) %>%
  compact()

# Display results
cat("\nSpatial Autocorrelation Results:\n")

Spatial Autocorrelation Results:
Code
for (name in names(spatial_results)) {
  res <- spatial_results[[name]]
  cat(sprintf("  %s: Moran's I=%.3f (p=%.4f)\n", 
              name, res$moran$estimate[1], res$moran$p.value))
  print(res$plot)
}
  Isolated Rural: Moran's I=0.012 (p=0.3606)

  Metropolitan: Moran's I=0.020 (p=0.0000)

  Micropolitan: Moran's I=0.048 (p=0.0027)

  Small Rural: Moran's I=0.063 (p=0.0087)

4.2 Interpretation of Spatial Tests

Key Findings:

  • Three groups (Metropolitan, Micropolitan, Small Rural) show statistically significant spatial autocorrelation (p < 0.01)
  • However, all Moran’s I values are small (0.012-0.063), indicating weak spatial dependence
  • With large sample sizes, even trivial spatial effects become statistically significant
  • The practical impact on coefficient estimates is likely minimal

4.3 Combined Residual Map

Code
# Combine residuals from all RUCA groups into a single spatial dataset
all_spatial_data <- map_dfr(names(models), function(name) {
  # Filter model data for this specific RUCA group
  model_data <- f_model %>% filter(RUCA_group == name)
  
  # Load spatial geometries and join with model data
  spatial_data <- st_read("data.shp", quiet = TRUE) %>% 
    mutate(GEOID = as.character(GEOID)) %>%
    filter(GEOID %in% model_data$GEOID) %>%
    inner_join(model_data, by = "GEOID")
  
  # Add residuals from this group's model
  spatial_data$residuals <- residuals(models[[name]], type = "pearson")
  spatial_data$group <- name  # Track which RUCA group each observation belongs to
  
  spatial_data
})

# Create statewide map showing residuals from all models combined
combined_plot <- ggplot() +
  # Plot residuals from all RUCA groups (blue = under-prediction, red = over-prediction)
  geom_sf(data = all_spatial_data, aes(fill = residuals), color = NA) +
  # Overlay Texas county boundaries for geographic reference
  geom_sf(data = tx_counties, fill = NA, color = "gray30", size = 0.3, alpha = 0.5) +
  # Use symmetric diverging color scale centered at zero
  scale_fill_gradient2(
    low = "#2166AC", mid = "white", high = "#B2182B", 
    midpoint = 0,
    name = "Pearson\nResiduals",
    # Set equal positive/negative limits based on maximum absolute residual
    limits = c(-max(abs(all_spatial_data$residuals), na.rm = TRUE), 
               max(abs(all_spatial_data$residuals), na.rm = TRUE))
  ) +
  labs(title = "Model Residuals: All RUCA Groups Combined",
       subtitle = "Spatial distribution of Pearson residuals across Texas") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11),
    legend.position = "right",
    panel.grid = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank()
  )

print(combined_plot)

Pattern: Positive residuals (under-prediction) concentrate in major urban cores, while negative residuals (over-prediction) appear in peripheral areas.

4.4 Summary Table

Code
if (length(spatial_results) > 0) {
  spatial_summary <- map_dfr(names(spatial_results), function(name) {
    res <- spatial_results[[name]]
    data.frame(
      Group = name,
      Moran_I = round(res$moran$estimate[1], 4),
      Moran_p = round(res$moran$p.value, 4),
      Geary_C = round(res$geary$estimate[1], 4),
      Geary_p = round(res$geary$p.value, 4)
    )
  })
  
  kable(spatial_summary, caption = "Spatial Autocorrelation Test Results") %>%
    kable_styling(c("striped", "hover"))
}
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Spatial Autocorrelation Test Results
Group Moran_I Moran_p Geary_C Geary_p
Moran I statistic...1 Isolated Rural 0.0118 0.3606 0.2469 0.0000
Moran I statistic...2 Metropolitan 0.0200 0.0000 1.3205 1.0000
Moran I statistic...3 Micropolitan 0.0483 0.0027 1.1135 0.9846
Moran I statistic...4 Small Rural 0.0631 0.0087 0.9915 0.4341

5. Walkability Effects

5.1 Marginal Effects Plots

Predicted crash counts across walkability values for each RUCA group, holding other variables at their means.

Code
safe_marginal_plot <- function(model, group_name) {
  tryCatch({
    preds <- ggpredict(model, terms = "NatWalkInd_c [all]", type = "fixed")
    plot(preds) + 
      ggtitle(paste("Marginal Effect of Walkability -", group_name)) +
      xlab("Walkability Index (Centered)") +
      ylab("Predicted Crash Count") +
      theme_minimal() +
      theme(
        plot.title = element_text(size = 12, hjust = 0.5),
        axis.title = element_text(size = 10)
      )
  }, error = function(e) NULL)
}

marginal_plots <- models %>%
  imap(~ safe_marginal_plot(.x, .y)) %>%
  compact()

for (i in seq_along(marginal_plots)) {
  print(marginal_plots[[i]])
}

Interpretation:

  • Metropolitan shows clear positive relationship with acceleration (quadratic effect)
  • Rural groups show flat or negative relationships
  • Walkability does not influence crash risk outside urban areas

6. Regression Results

6.1 Incident Rate Ratios

By exponentiating coefficients, we get IRRs showing multiplicative effects:

  • IRR > 1.0 = increases crashes
  • IRR < 1.0 = decreases crashes
  • IRR = 1.0 = no effect
Code
create_results_table <- function(models) {
  imap_dfr(models, function(model, groupname) {
    tidy(model, conf.int = TRUE, exponentiate = TRUE) %>%
      filter(term != "(Intercept)") %>%
      mutate(
        Group = groupname,
        `IRR (95% CI)` = sprintf("%.3f (%.3f-%.3f)", estimate, conf.low, conf.high),
        `P-value` = ifelse(p.value < 0.001, "<0.001", sprintf("%.3f", p.value)),
        Sig = case_when(
          p.value < 0.001 ~ "***",
          p.value < 0.01 ~ "**",
          p.value < 0.05 ~ "*",
          TRUE ~ ""
        ),
        Variable = recode(term,
          "NatWalkInd_c" = "Walkability (Centered)",
          "I(NatWalkInd_c^2)" = "Walkability² (Centered)",
          "popden" = "Population Density",
          "pct_white" = "Percent White (Std)",
          "pct_pov" = "Percent in Poverty (Std)"
        )
      ) %>%
      select(Group, Variable, `IRR (95% CI)`, `P-value`, Sig)
  })
}

results_table <- create_results_table(models)
Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred

Warning: glm.fit: fitted rates numerically 0 occurred
Code
kable(results_table, 
      caption = "Negative Binomial Regression Results by RUCA Group",
      align = c("l", "l", "c", "c", "c")) %>%
  kable_styling(c("striped", "hover", "condensed"), full_width = FALSE) %>%
  add_header_above(c(" " = 2, "Model Results" = 3)) %>%
  footnote(general = "IRR = Incident Rate Ratio; *** p<0.001, ** p<0.01, * p<0.05")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Negative Binomial Regression Results by RUCA Group
Model Results
Group Variable IRR (95% CI) P-value Sig
Isolated Rural Walkability (Centered) 0.914 (0.769-1.054) 0.261
Isolated Rural Walkability² (Centered) 0.989 (0.964-1.012) 0.367
Isolated Rural Population Density 0.981 (0.966-0.992) 0.005 **
Isolated Rural Percent White (Std) 0.876 (0.672-1.165) 0.345
Isolated Rural Percent in Poverty (Std) 0.914 (0.700-1.171) 0.491
Metropolitan Walkability (Centered) 1.176 (1.159-1.193) <0.001 ***
Metropolitan Walkability² (Centered) 1.007 (1.004-1.009) <0.001 ***
Metropolitan Population Density 1.000 (1.000-1.000) <0.001 ***
Metropolitan Percent White (Std) 0.829 (0.794-0.866) <0.001 ***
Metropolitan Percent in Poverty (Std) 1.215 (1.164-1.270) <0.001 ***
Micropolitan Walkability (Centered) 1.053 (0.946-1.181) 0.327
Micropolitan Walkability² (Centered) 0.992 (0.971-1.015) 0.446
Micropolitan Population Density 1.000 (1.000-1.000) 0.240
Micropolitan Percent White (Std) 0.856 (0.664-1.087) 0.178
Micropolitan Percent in Poverty (Std) 1.056 (0.873-1.293) 0.559
Small Rural Walkability (Centered) 0.858 (0.691-1.089) 0.144
Small Rural Walkability² (Centered) 0.972 (0.934-1.013) 0.141
Small Rural Population Density 1.000 (1.000-1.001) 0.523
Small Rural Percent White (Std) 1.074 (0.678-1.632) 0.705
Small Rural Percent in Poverty (Std) 1.047 (0.764-1.468) 0.778
Note:
IRR = Incident Rate Ratio; *** p<0.001, ** p<0.01, * p<0.05

6.2 Key Findings

Metropolitan Areas:

  • IRR = 1.176 for walkability: Each one-unit increase multiplies crash count by 1.176 (17.6% increase)
  • IRR = 1.007 for walkability²: Significant quadratic effect indicates acceleration at higher walkability
  • IRR = 1.215 for poverty: Increases crashes by 21.5%
  • IRR = 0.829 for percent white: Each SD increase multiplies crashes by 0.829 (17.1% decrease)

Rural Areas (Isolated, Micropolitan, Small):

  • No significant walkability effects
  • Very high zero-inflation (80-89%)
  • Minimal population density effects

Urban-Rural Divide:

The walkability-crash relationship differs fundamentally by context, likely reflecting traffic volume and infrastructure differences rather than walkability itself.

7. Discussion and Limitations

7.1 Main Conclusions

Strong urban-rural divide in walkability-crash relationships:

  1. Metropolitan areas: Significant positive associations with non-linearity
  2. Rural areas: No clear effects due to low traffic and crash rates
  3. Spatial autocorrelation: Weak but statistically significant; unlikely to bias conclusions

7.2 Limitations

  • Zero-inflation: 69-89% of areas have zero crashes; zero-inflated models may be more appropriate
  • Exposure measurement: Population imperfectly measures actual traffic exposure
  • Causal interpretation: Positive walkability-crash association may reflect traffic volume rather than safety hazards
  • Spatial dependence: While weak, unmeasured spatial processes may influence results

7.3 Future Directions

  • Explore zero-inflated models for excess zeros
  • Incorporate traffic volume data where available
  • Distinguish pedestrian crashes from all crashes
  • Consider geographically weighted regression for spatial non-stationarity